Cp bench by sudhakarsingh27 · Pull Request #5 · sudhakarsingh27/TransformerEngine

sudhakarsingh27 · 2026-05-01T04:33:17Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Move benchmarking infrastructure for CP attention onto a dedicated branch so it persists outside of stash. The core test suite (test_attention_with_cp.py) stays focused on correctness; this branch layers benchmark/profile/stress configs and a cross-backend consistency check on top. run_attention_with_cp.py changes (worker side): - thd_seqlen_pattern arg supports max/half/linear/alternating/random and explicit comma-separated lengths, so benchmark configs can pin a specific variable-length workload instead of randomizing per-run. - benchmark arg drives a 10-warmup + N-iter timing loop wrapped in cudaProfilerStart/Stop and prints ms/iter for nsys/ncu workflows. - torch.manual_seed(1234) for reproducibility across runs. - CP_CROSS_BACKEND_SAVE_DIR env saves per-rank inputs/outputs as .pt for the cross-backend consistency test to compare without re-running. - Soft import from benchmark_cp so the worker can resolve names like cp_thd_0, bench_8k, bariamis_8k, rl16k without test_attention_with_cp.py needing to know about them. benchmark_cp.py (new): - Stress configs (cp_thd_0..3, cp_thd_swa_0..3) — higher batch/longer seqlen than the core suite. - Llama3-8b-shaped configs (bench_8k/16k/32k). - Variable-length training-workload configs (rl16k, bucket32k/64k/128k, mixed32k, outlier64k) with per-config thd_seqlen_pattern. - Worker-only configs (bariamis_*, bench_84992/86016) for manual invocation against the AG spike investigation log shapes. - test_cp_thd_cross_backend_consistency: runs each backend (p2p/all_gather/a2a) on the same input, saves outputs via CP_CROSS_BACKEND_SAVE_DIR, and asserts pairwise agreement within atol=0.1. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Add 18 SWA training workload configs (6 real workloads × 3 windows) to benchmark_cp.py for benchmarking sliding-window attention with context parallelism. Replace the old single-GPU FusedAttn vs FlashAttn benchmark script with a README documenting full benchmark results (full causal + SWA, cp=2/4/8, p2p/all_gather/a2a) and individual config runner usage. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Re-ran all 6 real-training configs (full causal + SWA{512,1024,2048}) on a second 8x H100 node with cuDNN 9.21 / NCCL 2.29.7 and replaced the prior results tables. cp=2 was re-run serially because 4-wide concurrency on a single node distorted a2a SWA timings ~2x and triggered intermittent cudaErrorIllegalInstruction on AG SWA configs. The original-node bucket128k SWA AG cp>=4 'FAIL' matrix is no longer present on the new node, but a smaller intermittent-crash failure mode (cp=2 SWA AG under heavy concurrency) was observed; documented as a known issue with the serial-run workaround. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 added 2 commits April 30, 2026 13:01

sudhakarsingh27 force-pushed the cp_bench branch from 67cfbef to 0497cc8 Compare May 1, 2026 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cp bench#5

Cp bench#5
sudhakarsingh27 wants to merge 3 commits into
cp_thd_swa_with_agfrom
cp_bench

sudhakarsingh27 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sudhakarsingh27 commented May 1, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant